1 -00:00:00,485 --> 00:00:05,515 2 00:00:08,355 --> 00:00:11,357 - Okay, it's after 12, so I think we should get started. 3 00:00:14,644 --> 00:00:17,419 Today we're going to kind of pick up where we left off last time. 4 00:00:17,419 --> 00:00:23,400 Last time we talked about a lot of sort of tips and tricks involved in the nitty gritty details of training neural networks. 5 00:00:23,400 --> 00:00:30,439 Today we'll pick up where we left off, and talk about a lot more of these sort of nitty gritty details about training these things. 6 00:00:30,439 --> 00:00:34,707 As usual, a couple administrative notes before we get into the material. 7 00:00:34,707 --> 00:00:39,645 As you all know, assignment one is already due. Hopefully you all turned it in. 8 00:00:39,645 --> 00:00:57,322 Did it go okay? Was it not okay? Rough sentiment? Mostly okay. Okay, that's good. Awesome. [laughs] We're in the process of grading those, so stay turned. We're hoping to get grades back for those before A two is due. 9 00:00:57,322 --> 00:01:04,121 Another reminder, that your project proposals are due tomorrow. Actually, no, today at 11:59. 10 00:01:04,959 --> 00:01:09,074 Make sure you send those in. Details are on the website and on Piazza. 11 00:01:09,074 --> 00:01:15,269 Also a reminder, assignment two is already out. That'll be due a week from Thursday. 12 00:01:15,269 --> 00:01:25,860 Historically, assignment two has been the longest one in the class, so if you haven't started already on assignment two, I'd recommend you take a look at that pretty soon. 13 00:01:27,122 --> 00:01:32,484 Another reminder is that for assignment two, I think of a lot of you will be using Google Cloud. 14 00:01:32,484 --> 00:01:38,586 Big reminder, make sure to stop your instances when you're not using them because whenever your instance is on, you get charged, 15 00:01:38,586 --> 00:01:42,899 and we only have so many coupons to distribute to you guys. 16 00:01:42,899 --> 00:01:52,223 Anytime your instance is on, even if you're not SSH to it, even if you're not running things immediately in your Jupyter Notebook, any time that instance is on, you're going to be charged. 17 00:01:52,223 --> 00:01:57,118 Just make sure that you explicitly stop your instances when you're not using them. 18 00:01:57,118 --> 00:02:04,970 In this example, I've got a little screenshot of my dashboard on Google Cloud. I need to go in there and explicitly go to the dropdown and click stop. 19 00:02:04,970 --> 00:02:08,644 Just make sure that you do this when you're done working each day. 20 00:02:09,481 --> 00:02:20,853 Another thing to remember is it's kind of up to you guys to keep track of your spending on Google Cloud. In particular, instances that use GPUs are a lot more expensive than those with CPUs. 21 00:02:20,853 --> 00:02:28,322 Rough order of magnitude, those GPU instances are around 90 cents to a dollar an hour. Those are actually quite pricey. 22 00:02:28,322 --> 00:02:39,739 The CPU instances are much cheaper. The general strategy is that you probably want to make two instances, one with a GPU and one without, and then only use that GPU instance when you really need the GPU. 23 00:02:39,739 --> 00:02:47,377 For example, on assignment two, most of the assignment, you should only need the CPU, so you should only use your CPU instance for that. 24 00:02:47,377 --> 00:02:52,990 But then the final question, about TensorFlow or PyTorch, that will need a GPU. 25 00:02:52,990 --> 00:02:58,897 This'll give you a little bit of practice with switching between multiple instances and only using that GPU when it's really necessary. 26 00:02:58,897 --> 00:03:04,307 Again, just kind of watch your spending. Try not to go too crazy on these things. 27 00:03:04,307 --> 00:03:07,748 Any questions on the administrative stuff before we move on? 28 00:03:11,180 --> 00:03:12,182 Question. 29 00:03:12,182 --> 00:03:13,902 - [Student] How much RAM should we use? 30 00:03:13,902 --> 00:03:16,133 - Question is how much RAM should we use? 31 00:03:16,133 --> 00:03:21,863 I think eight or 16 gigs is probably good for everything that you need in this class. 32 00:03:21,863 --> 00:03:27,114 As you scale up the number of CPUs and the number of RAM, you also end up spending more money. 33 00:03:27,114 --> 00:03:34,542 If you stick with two or four CPUs and eight or 16 gigs of RAM, that should be plenty for all the homework-related stuff that you need to do. 34 00:03:36,636 --> 00:03:40,417 As a quick recap, last time we talked about activation functions. 35 00:03:40,417 --> 00:03:44,962 We talked about this whole zoo of different activation functions and some of their different properties. 36 00:03:44,962 --> 00:03:59,736 We saw that the sigmoid, which used to be quite popular when training neural networks maybe 10 years ago or so, has this problem with vanishing gradients near the two ends of the activation function. tanh has this similar sort of problem. 37 00:03:59,736 --> 00:04:09,230 Kind of the general recommendation is that you probably want to stick with ReLU for most cases as sort of a default choice 'cause it tends to work well for a lot of different architectures. 38 00:04:09,230 --> 00:04:16,820 We also talked about weight initialization. Remember that up on the top, we have this idea that when you initialize your weights 39 00:04:16,820 --> 00:04:23,787 at the start of training, if those weights are initialized to be too small, then if you look at, then the activations will vanish 40 00:04:23,788 --> 00:04:29,583 as you go through the network because as you multiply by these small numbers over and over again, they'll all sort of decay to zero. 41 00:04:29,583 --> 00:04:33,072 Then everything will be zero, learning won't happen, you'll be sad. 42 00:04:33,072 --> 00:04:41,208 On the other hand, if you initialize your weights too big, then as you go through the network and multiply by your weight matrix over and over again, eventually they'll explode. 43 00:04:41,208 --> 00:04:45,389 You'll be unhappy, there'll be no learning, it will be very bad. 44 00:04:45,389 --> 00:04:58,531 But if you get that initialization just right, for example, using the Xavier initialization or the MSRA initialization, then you kind of keep a nice distribution of activations as you go through the network. 45 00:04:58,531 --> 00:05:04,328 Remember that this kind of gets more and more important and more and more critical as your networks get deeper and deeper 46 00:05:04,328 --> 00:05:11,620 because as your network gets deeper, you're multiplying by those weight matrices over and over again with these more multiplicative terms. 47 00:05:11,620 --> 00:05:23,666 We also talked last time about data preprocessing. We talked about how it's pretty typical in conv nets to zero center and normalize your data so it has zero mean and unit variance. 48 00:05:23,666 --> 00:05:29,968 I wanted to provide a little bit of extra intuition about why you might actually want to do this. 49 00:05:29,968 --> 00:05:39,532 Imagine a simple setup where we have a binary classification problem where we want to draw a line to separate these red points from these blue points. 50 00:05:39,532 --> 00:05:46,948 On the left, you have this idea where if those data points are kind of not normalized and not centered and far away from the origin, 51 00:05:46,948 --> 00:05:55,007 then we can still use a line to separate them, but now if that line wiggles just a little bit, then our classification is going to get totally destroyed. 52 00:05:55,007 --> 00:06:05,992 That kind of means that in the example on the left, the loss function is now extremely sensitive to small perturbations in that linear classifier in our weight matrix. 53 00:06:07,315 --> 00:06:14,554 We can still represent the same functions, but that might make learning quite difficult because, again, their loss is very sensitive 54 00:06:14,554 --> 00:06:25,351 to our parameter vector, whereas in the situation on the right, if you take that data cloud and you move it into the origin and you make it unit variance, then now, again, we can still 55 00:06:25,351 --> 00:06:35,523 classify that data quite well, but now as we wiggle that line a little bit, then our loss function is less sensitive to small perturbations in the parameter values. 56 00:06:35,523 --> 00:06:41,064 That maybe makes optimization a little bit easier, as we'll see a little bit going forward. 57 00:06:41,064 --> 00:06:46,539 By the way, this situation is not only in the linear classification case. 58 00:06:46,539 --> 00:06:57,756 Inside a neural network, remember we kind of have these interleavings of these linear matrix multiplies, or convolutions, followed by non-linear activation functions. 59 00:06:59,078 --> 00:07:05,687 If the input to some layer in your neural network is not centered or not zero mean, not unit variance, then again, 60 00:07:05,687 --> 00:07:15,632 small perturbations in the weight matrix of that layer of the network could cause large perturbations in the output of that layer, which, again, might make learning difficult. 61 00:07:15,632 --> 00:07:20,481 This is kind of a little bit of extra intuition about why normalization might be important. 62 00:07:21,864 --> 00:07:26,862 Because we have this intuition that normalization is so important, we talked about batch normalization, 63 00:07:26,862 --> 00:07:36,030 which is where we just add this additional layer inside our networks to just force all of the intermediate activations to be zero mean and unit variance. 64 00:07:36,030 --> 00:07:41,465 I've sort of resummarized the batch normalization equations here with the shapes a little bit more explicitly. 65 00:07:41,465 --> 00:07:45,172 Hopefully this can help you out when you're implementing this thing on assignment two. 66 00:07:45,172 --> 00:07:59,254 But again, in batch normalization, we have this idea that in the forward pass, we use the statistics of the mini batch to compute a mean and a standard deviation, and then use those estimates to normalize our data on the forward pass. 67 00:07:59,254 --> 00:08:05,641 Then we also reintroduce the scale and shift parameters to increase the expressivity of the layer. 68 00:08:05,641 --> 00:08:09,990 You might want to refer back to this when working on assignment two. 69 00:08:09,990 --> 00:08:18,146 We also talked last time a little bit about babysitting the learning process, how you should probably be looking at your loss curves during training. 70 00:08:18,146 --> 00:08:26,683 Here's an example of some networks I was actually training over the weekend. This is usually my setup when I'm working on these things. 71 00:08:26,683 --> 00:08:35,795 On the left, I have some plot showing the training loss over time. You can see it's kind of going down, which means my network is reducing the loss. It's doing well. 72 00:08:35,795 --> 00:08:48,464 On the right, there's this plot where the X axis is, again, time, or the iteration number, and the Y axis is my performance measure both on my training set and on my validation set. 73 00:08:48,465 --> 00:08:58,680 You can see that as we go over time, then my training set performance goes up and up and up and up and up as my loss function goes down, but at some point, my validation set performance kind of plateaus. 74 00:08:58,680 --> 00:09:05,066 This kind of suggests that maybe I'm overfitting in this situation. Maybe I should have been trying to add additional regularization. 75 00:09:06,317 --> 00:09:09,504 We also talked a bit last time about hyperparameter search. 76 00:09:09,504 --> 00:09:14,798 All these networks have sort of a large zoo of hyperparameters. It's pretty important to set them correctly. 77 00:09:14,798 --> 00:09:20,725 We talked a little bit about grid search versus random search, and how random search is maybe a little bit nicer in theory 78 00:09:20,725 --> 00:09:30,669 because in the situation where your performance might be more sensitive, with respect to one hyperparameter than other, and random search lets you cover that space a little bit better. 79 00:09:30,669 --> 00:09:37,005 We also talked about the idea of coarse to fine search, where when you're doing this hyperparameter optimization, probably you 80 00:09:37,005 --> 00:09:43,408 want to start with very wide ranges for your hyperparameters, only train for a couple iterations, and then based on 81 00:09:43,408 --> 00:09:47,973 those results, you kind of narrow in on the range of hyperparameters that are good. 82 00:09:47,973 --> 00:09:51,666 Now, again, redo your search in a smaller range for more iterations. 83 00:09:51,666 --> 00:09:56,708 You can kind of iterate this process to kind of hone in on the right region for hyperparameters. 84 00:09:56,708 --> 00:10:04,455 But again, it's really important to, at the start, have a very coarse range to start with, where you want very, very wide ranges for all your hyperparameters. 85 00:10:04,455 --> 00:10:13,746 Ideally, those ranges should be so wide that your network is kind of blowing up at either end of the range so that you know that you've searched a wide enough range for those things. 86 00:10:17,462 --> 00:10:18,295 Question? 87 00:10:20,044 --> 00:10:26,672 - [Student] How many [speaks too low to hear] optimize at once? [speaks too low to hear] 88 00:10:31,840 --> 00:10:34,554 - The question is how many hyperparameters do we typically search at a time? 89 00:10:34,554 --> 00:10:38,244 Here is two, but there's a lot more than two in these typical things. 90 00:10:38,244 --> 00:10:45,442 It kind of depends on the exact model and the exact architecture, but because the number of possibilities is exponential in the number of hyperparameters, 91 00:10:45,442 --> 00:10:48,012 you can't really test too many at a time. 92 00:10:48,012 --> 00:10:51,737 It also kind of depends on how many machines you have available. 93 00:10:51,737 --> 00:10:55,745 It kind of varies from person to person and from experiment to experiment. 94 00:10:55,745 --> 00:11:05,353 But generally, I try not to do this over more than maybe two or three or four at a time at most because, again, this exponential search just gets out of control. 95 00:11:05,353 --> 00:11:10,406 Typically, learning rate is the really important one that you need to nail first. 96 00:11:10,406 --> 00:11:19,542 Then other things, like regularization, like learning rate decay, model size, these other types of things tend to be a little bit less sensitive than learning rate. 97 00:11:19,542 --> 00:11:22,723 Sometimes you might do kind of a block coordinate descent, where you go and find 98 00:11:22,723 --> 00:11:27,459 the good learning rate, then you go back and try to look at different model sizes. 99 00:11:27,459 --> 00:11:30,759 This can help you cut down on the exponential search a little bit, 100 00:11:30,759 --> 00:11:35,370 but it's a little bit problem dependent on exactly which ones you should be searching over in which order. 101 00:11:36,253 --> 00:11:38,120 More questions? 102 00:11:38,120 --> 00:11:57,041 - [Student] [speaks too low to hear] Another parameter, but then changing that other parameter, two or three other parameters, makes it so that your learning rate or the ideal learning rate is still [speaks too low to hear]. 103 00:11:57,041 --> 00:12:04,537 - Question is how often does it happen where when you change one hyperparameter, then the other, the optimal values of the other hyperparameters change? 104 00:12:04,537 --> 00:12:11,339 That does happen sometimes, although for learning rates, that's typically less of a problem. 105 00:12:11,339 --> 00:12:18,130 For learning rates, typically you want to get in a good range, and then set it maybe even a little bit lower than optimal, and let it go for a long time. 106 00:12:18,130 --> 00:12:31,291 Then if you do that, combined with some of the fancier optimization strategies that we'll talk about today, then a lot of models tend to be a little bit less sensitive to learning rate once you get them in a good range. 107 00:12:31,291 --> 00:12:32,962 Sorry, did you have a question in front, as well? 108 00:12:32,962 --> 00:12:37,308 - [Student] [speaks too low to hear] 109 00:12:37,308 --> 00:12:41,292 - The question is what's wrong with having a small learning rate and increasing the number of epochs? 110 00:12:41,292 --> 00:12:45,139 The answer is that it might take a very long time. [laughs] 111 00:12:45,139 --> 00:12:48,383 - [Student] [speaks too low to hear] 112 00:12:48,383 --> 00:12:54,853 - Intuitively, if you set the learning rate very low and let it go for a very long time, then this should, in theory, always work. 113 00:12:54,853 --> 00:13:00,491 But in practice, those factors of 10 or 100 actually matter a lot when you're training these things. 114 00:13:00,491 --> 00:13:03,931 Maybe if you got the right learning rate, you could train it in six hours, 12 hours 115 00:13:03,931 --> 00:13:11,911 or a day, but then if you just were super safe and dropped it by a factor of 10 or by a factor of 100, now that one-day training becomes 100 days of training. 116 00:13:11,911 --> 00:13:16,400 That's three months. That's not going to be good. 117 00:13:16,400 --> 00:13:20,668 When you're taking these intro computer science classes, they always kind of sweep the constants under the rug, but when 118 00:13:20,668 --> 00:13:25,444 you're actually thinking about training things, those constants end up mattering a lot. 119 00:13:25,444 --> 00:13:26,861 Another question? 120 00:13:27,877 --> 00:13:33,385 - [Student] If you have a low learning rate, [speaks too low to hear]. 121 00:13:33,385 --> 00:13:37,807 - Question is for a low learning rate, are you more likely to be stuck in local optima? 122 00:13:37,807 --> 00:13:42,601 I think that makes some intuitive sense, but in practice, that seems not to be much of a problem. 123 00:13:42,601 --> 00:13:47,030 I think we'll talk a bit more about that later today. 124 00:13:47,030 --> 00:13:53,151 Today I wanted to talk about a couple other really interesting and important topics when we're training neural networks. 125 00:13:53,151 --> 00:13:59,655 In particular, I wanted to talk, we've kind of alluded to this fact of fancier, more powerful optimization algorithms a couple times. 126 00:13:59,655 --> 00:14:07,067 I wanted to spend some time today and really dig into those and talk about what are the actual optimization algorithms that most people are using these days. 127 00:14:07,067 --> 00:14:10,364 We also touched on regularization in earlier lectures. 128 00:14:10,364 --> 00:14:15,806 This concept of making your network do additional things to reduce the gap between train and test error. 129 00:14:15,806 --> 00:14:22,143 I wanted to talk about some more strategies that people are using in practice of regularization, with respect to neural networks. 130 00:14:22,143 --> 00:14:26,401 Finally, I also wanted to talk a bit about transfer learning, where you can 131 00:14:26,401 --> 00:14:31,490 sometimes get away with using less data than you think by transferring from one problem to another. 132 00:14:32,821 --> 00:14:39,885 If you recall from a few lectures ago, the kind of core strategy in training neural networks is an optimization problem 133 00:14:39,885 --> 00:14:50,982 where we write down some loss function, which defines, for each value of the network weights, the loss function tells us how good or bad is that value of the weights doing on our problem. 134 00:14:50,982 --> 00:14:56,508 Then we imagine that this loss function gives us some nice landscape over the weights, 135 00:14:56,508 --> 00:15:04,142 where on the right, I've shown this maybe small, two-dimensional problem, where the X and Y axes are two values of the weights. 136 00:15:04,142 --> 00:15:07,984 Then the color of the plot kind of represents the value of the loss. 137 00:15:07,984 --> 00:15:15,195 In this kind of cartoon picture of a two-dimensional problem, we're only optimizing over these two values, W one, W two. 138 00:15:15,195 --> 00:15:23,203 The goal is to find the most red region in this case, which corresponds to the setting of the weights with the lowest loss. 139 00:15:23,203 --> 00:15:29,099 Remember, we've been working so far with this extremely simple optimization algorithm, stochastic gradient descent, 140 00:15:29,099 --> 00:15:32,393 where it's super simple, it's three lines. 141 00:15:32,393 --> 00:15:39,179 While true, we first evaluate the loss in the gradient on some mini batch of data. 142 00:15:39,179 --> 00:15:44,656 Then we step, updating our parameter vector in the negative direction of the gradient 143 00:15:44,656 --> 00:15:48,798 because this gives, again, the direction of greatest decrease of the loss function. 144 00:15:48,798 --> 00:15:56,282 Then we repeat this over and over again, and hopefully we converge to the red region and we get great errors and we're very happy. 145 00:15:56,282 --> 00:16:05,462 But unfortunately, this relatively simple optimization algorithm has quite a lot of problems that actually could come up in practice. 146 00:16:05,462 --> 00:16:08,713 One problem with stochastic gradient descent, 147 00:16:08,713 --> 00:16:18,969 imagine what happens if our objective function looks something like this, where, again, we're plotting two values, W one and W two. 148 00:16:18,969 --> 00:16:23,472 As we change one of those values, the loss function changes very slowly. 149 00:16:23,472 --> 00:16:26,687 As we change the horizontal value, then our loss changes slowly. 150 00:16:28,152 --> 00:16:34,930 As we go up and down in this landscape, now our loss is very sensitive to changes in the vertical direction. 151 00:16:34,930 --> 00:16:40,757 By the way, this is referred to as the loss having a bad condition number at this point, 152 00:16:40,757 --> 00:16:46,050 which is the ratio between the largest and smallest singular values of the Hessian matrix at that point. 153 00:16:46,050 --> 00:16:50,497 But the intuitive idea is that the loss landscape kind of looks like a taco shell. 154 00:16:50,497 --> 00:16:54,393 It's sort of very sensitive in one direction, not sensitive in the other direction. 155 00:16:54,393 --> 00:17:00,633 The question is what might SGD, stochastic gradient descent, do on a function that looks like this? 156 00:17:05,310 --> 00:17:12,196 If you run stochastic gradient descent on this type of function, you might get this characteristic zigzagging behavior, 157 00:17:12,197 --> 00:17:22,111 where because for this type of objective function, the direction of the gradient does not align with the direction towards the minima. 158 00:17:22,112 --> 00:17:29,335 When you compute the gradient and take a step, you might step sort of over this line and sort of zigzag back and forth. 159 00:17:29,335 --> 00:17:35,995 In effect, you get very slow progress along the horizontal dimension, which is the less sensitive dimension, and you get this 160 00:17:35,995 --> 00:17:41,551 zigzagging, nasty, nasty zigzagging behavior across the fast-changing dimension. 161 00:17:41,551 --> 00:17:50,139 This is undesirable behavior. By the way, this problem actually becomes much more common in high dimensions. 162 00:17:51,186 --> 00:18:00,617 In this kind of cartoon picture, we're only showing a two-dimensional optimization landscape, but in practice, our neural networks might have millions, tens of millions, hundreds of millions of parameters. 163 00:18:00,617 --> 00:18:14,221 That's hundreds of millions of directions along which this thing can move. Now among those hundreds of millions of different directions to move, if the ratio between the largest one and the smallest one is bad, then SGD will not perform so nicely. 164 00:18:14,221 --> 00:18:20,573 You can imagine that if we have 100 million parameters, probably the maximum ratio between those two will be quite large. 165 00:18:20,573 --> 00:18:26,398 I think this is actually quite a big problem in practice for many high-dimensional problems. 166 00:18:27,793 --> 00:18:33,564 Another problem with SGD has to do with this idea of local minima or saddle points. 167 00:18:33,564 --> 00:18:44,003 Here I've sort of swapped the graph a little bit. Now the X axis is showing the value of one parameter, and then the Y axis is showing the value of the loss. 168 00:18:44,003 --> 00:18:51,583 In this top example, we have kind of this curvy objective function, where there's a valley in the middle. 169 00:18:51,583 --> 00:18:55,036 What happens to SGD in this situation? 170 00:18:55,036 --> 00:18:57,031 - [Student] [speaks too low to hear] 171 00:18:57,031 --> 00:19:04,454 - In this situation, SGD will get stuck because at this local minima, the gradient is zero because it's locally flat. 172 00:19:04,454 --> 00:19:09,194 Now remember with SGD, we compute the gradient and step in the direction of opposite gradient, 173 00:19:09,194 --> 00:19:15,862 so if at our current point, the opposite gradient is zero, then we're not going to make any progress, and we'll get stuck at this point. 174 00:19:15,862 --> 00:19:19,406 There's another problem with this idea of saddle points. 175 00:19:19,406 --> 00:19:26,140 Rather than being a local minima, you can imagine a point where in one direction we go up, and in the other direction we go down. 176 00:19:26,140 --> 00:19:28,953 Then at our current point, the gradient is zero. 177 00:19:28,953 --> 00:19:35,899 Again, in this situation, the function will get stuck at the saddle point because the gradient is zero. 178 00:19:35,899 --> 00:19:48,122 Although one thing I'd like to point out is that in one dimension, in a one-dimensional problem like this, local minima seem like a big problem and saddle points seem like kind of not something to worry about, but in fact, 179 00:19:48,122 --> 00:19:57,171 it's the opposite once you move to very high-dimensional problems because, again, if you think about you're in this 100 million dimensional space, what does a saddle point mean? 180 00:19:57,171 --> 00:20:03,135 That means that at my current point, some directions the loss goes up, and some directions the loss goes down. 181 00:20:03,135 --> 00:20:09,591 If you have 100 million dimensions, that's probably going to happen more frequently than, that's probably going to happen almost everywhere, basically. 182 00:20:09,591 --> 00:20:16,744 Whereas a local minima says that of all those 100 million directions that I can move, every one of them causes the loss to go up. 183 00:20:16,744 --> 00:20:22,316 In fact, that seems pretty rare when you're thinking about, again, these very high-dimensional problems. 184 00:20:23,270 --> 00:20:33,283 Really, the idea that has come to light in the last few years is that when you're training these very large neural networks, the problem is more about saddle points and less about local minima. 185 00:20:33,283 --> 00:20:40,140 By the way, this also is a problem not just exactly at the saddle point, but also near the saddle point. 186 00:20:40,140 --> 00:20:47,935 If you look at the example on the bottom, you see that in the regions around the saddle point, the gradient isn't zero, but the slope is very small. 187 00:20:47,935 --> 00:20:53,611 That means that if we're, again, just stepping in the direction of the gradient, and that gradient is very small, we're going to make 188 00:20:53,611 --> 00:21:01,872 very, very slow progress whenever our current parameter value is near a saddle point in the objective landscape. 189 00:21:01,872 --> 00:21:10,115 This is actually a big problem. Another problem with SGD comes from the S. 190 00:21:10,115 --> 00:21:13,521 Remember that SGD is stochastic gradient descent. 191 00:21:13,521 --> 00:21:20,586 Recall that our loss function is typically defined by computing the loss over many, many different examples. 192 00:21:20,586 --> 00:21:26,119 In this case, if N is your whole training set, then that could be something like a million. 193 00:21:26,119 --> 00:21:29,347 Each time computing the loss would be very, very expensive. 194 00:21:29,347 --> 00:21:36,957 In practice, remember that we often estimate the loss and estimate the gradient using a small mini batch of examples. 195 00:21:36,957 --> 00:21:42,148 What this means is that we're not actually getting the true information about the gradient at every time step. 196 00:21:42,148 --> 00:21:46,773 Instead, we're just getting some noisy estimate of the gradient at our current point. 197 00:21:46,773 --> 00:21:50,575 Here on the right, I've kind of faked this plot a little bit. 198 00:21:50,575 --> 00:21:59,927 I've just added random uniform noise to the gradient at every point, and then run SGD with these noisy, messed up gradients. 199 00:21:59,927 --> 00:22:07,987 This is maybe not exactly what happens with the SGD process, but it still give you the sense that if there's noise in your gradient estimates, then vanilla SGD 200 00:22:07,987 --> 00:22:14,036 kind of meanders around the space and might actually take a long time to get towards the minima. 201 00:22:15,723 --> 00:22:18,966 Now that we've talked about a lot of these problems. 202 00:22:18,966 --> 00:22:20,956 Sorry, was there a question? 203 00:22:20,956 --> 00:22:25,123 - [Student] [speaks too low to hear] 204 00:22:29,099 --> 00:22:34,435 - The question is do all of these just go away if we use normal gradient descent? 205 00:22:35,281 --> 00:22:44,106 Let's see. I think that the taco shell problem of high condition numbers is still a problem with full batch gradient descent. 206 00:22:44,106 --> 00:22:54,120 The noise. As we'll see, we might sometimes introduce additional noise into the network, not only due to sampling mini batches, but also due to explicit stochasticity in the network, 207 00:22:54,120 --> 00:22:57,736 so we'll see that later. That can still be a problem. 208 00:22:57,736 --> 00:23:05,101 Saddle points, that's still a problem for full batch gradient descent because there can still be saddle points in the full objective landscape. 209 00:23:05,101 --> 00:23:10,249 Basically, even if we go to full batch gradient descent, it doesn't really solve these problems. 210 00:23:10,249 --> 00:23:16,604 We kind of need to think about a slightly fancier optimization algorithm that can try to address these concerns. 211 00:23:16,604 --> 00:23:21,966 Thankfully, there's a really, really simple strategy that works pretty well at addressing many of these problems. 212 00:23:21,966 --> 00:23:26,978 That's this idea of adding a momentum term to our stochastic gradient descent. 213 00:23:26,978 --> 00:23:32,923 Here on the left, we have our classic old friend, SGD, where we just always step in the direction of the gradient. 214 00:23:32,923 --> 00:23:43,062 But now on the right, we have this minor, minor variance called SGD plus momentum, which is now two equations and five lines of code, so it's twice as complicated. 215 00:23:43,062 --> 00:23:51,331 But it's very simple. The idea is that we maintain a velocity over time, and we add our gradient estimates to the velocity. 216 00:23:51,331 --> 00:23:57,811 Then we step in the direction of the velocity, rather than stepping in the direction of the gradient. 217 00:23:57,811 --> 00:24:04,825 This is very, very simple. We also have this hyperparameter rho now which corresponds to friction. 218 00:24:05,925 --> 00:24:16,848 Now at every time step, we take our current velocity, we decay the current velocity by the friction constant, rho, which is often something high, like .9 is a common choice. 219 00:24:16,848 --> 00:24:21,173 We take our current velocity, we decay it by friction and we add in our gradient. 220 00:24:21,173 --> 00:24:26,999 Now we step in the direction of our velocity vector, rather than the direction of our raw gradient vector. 221 00:24:28,327 --> 00:24:34,548 This super, super simple strategy actually helps for all of these problems that we just talked about. 222 00:24:34,548 --> 00:24:44,809 If you think about what happens at local minima or saddle points, then if we're imagining velocity in this system, then you kind of have this physical interpretation of this ball 223 00:24:44,809 --> 00:24:48,215 kind of rolling down the hill, picking up speed as it comes down. 224 00:24:48,215 --> 00:24:56,922 Now once we have velocity, then even when we pass that point of local minima, the point will still have velocity, even if it doesn't have gradient. 225 00:24:56,922 --> 00:25:01,039 Then we can hopefully get over this local minima and continue downward. 226 00:25:01,039 --> 00:25:03,809 There's this similar intuition near saddle points, 227 00:25:03,809 --> 00:25:10,734 where even though the gradient around the saddle point is very small, we have this velocity vector that we've built up as we roll downhill. 228 00:25:10,734 --> 00:25:16,462 That can hopefully carry us through the saddle point and let us continue rolling all the way down. 229 00:25:16,462 --> 00:25:21,949 If you think about what happens in poor conditioning, now if we were to have 230 00:25:21,949 --> 00:25:31,105 these kind of zigzagging approximations to the gradient, then those zigzags will hopefully cancel each other out pretty fast once we're using momentum. 231 00:25:31,105 --> 00:25:46,006 - This will effectively reduce the amount by which we step in the sensitive direction, whereas in the horizontal direction, our velocity will just keep building up, and will actually accelerate our descent - across that less sensitive dimension. 232 00:25:46,006 --> 00:25:51,344 Adding momentum here can actually help us with this high condition number problem, as well. 233 00:25:51,344 --> 00:26:05,207 Finally, on the right, we've repeated the same visualization of gradient descent with noise. Here, the black is this vanilla SGD, which is sort of zigzagging all over the place, where the blue line is showing now SGD with momentum. 234 00:26:05,207 --> 00:26:12,644 You can see that because we're adding it, we're building up this velocity over time, the noise kind of gets averaged out in our gradient estimates. 235 00:26:12,644 --> 00:26:20,337 Now SGD ends up taking a much smoother path towards the minima, compared with the SGD, which is kind of meandering due to noise. 236 00:26:20,337 --> 00:26:21,532 Question? 237 00:26:21,532 --> 00:26:25,699 - [Student] [speaks too low to hear] 238 00:26:34,776 --> 00:26:40,465 - The question is how does SGD momentum help with the poorly conditioned coordinate? 239 00:26:40,465 --> 00:26:49,125 The idea is that if you go back and look at this velocity estimate and look at the velocity computation, we're adding in the gradient at every time step. 240 00:26:49,125 --> 00:26:56,603 It kind of depends on your setting of rho, that hyperparameter, but you can imagine that if the gradient is relatively small, 241 00:26:56,603 --> 00:26:59,254 and if rho is well behaved in this situation, 242 00:26:59,254 --> 00:27:05,512 then our velocity could actually monotonically increase up to a point where the velocity could now be larger than the actual gradient. 243 00:27:05,512 --> 00:27:10,377 Then we might actually make faster progress along the poorly conditioned dimension. 244 00:27:12,569 --> 00:27:18,020 Kind of one picture that you can have in mind when we're doing SGD plus momentum is that 245 00:27:18,020 --> 00:27:20,273 the red here is our current point. 246 00:27:20,273 --> 00:27:30,075 At our current point, we have some red vector, which is the direction of the gradient, or rather our estimate of the gradient at the current point. Green is now the direction of our velocity vector. 247 00:27:30,075 --> 00:27:36,317 Now when we do the momentum update, we're actually stepping according to a weighted average of these two. 248 00:27:36,317 --> 00:27:40,049 This helps overcome some noise in our gradient estimate. 249 00:27:40,049 --> 00:27:47,724 There's a slight variation of momentum that you sometimes see, called Nesterov accelerated gradient, also sometimes called Nesterov momentum. 250 00:27:47,724 --> 00:27:51,737 That switches up this order of things a little bit. 251 00:27:51,737 --> 00:28:00,285 In sort of normal SGD momentum, we imagine that we estimate the gradient at our current point, and then take a mix of our velocity and our gradient. 252 00:28:00,285 --> 00:28:04,229 With Nesterov accelerated gradient, you do something a little bit different. 253 00:28:04,229 --> 00:28:10,765 Here, you start at the red point. You step in the direction of where the velocity would take you. 254 00:28:10,765 --> 00:28:18,732 You evaluate the gradient at that point. Then you go back to your original point and kind of mix together those two. 255 00:28:18,732 --> 00:28:25,679 This is kind of a funny interpretation, but you can imagine that you're kind of mixing together information a little bit more. 256 00:28:25,679 --> 00:28:34,702 If your velocity direction was actually a little bit wrong, it lets you incorporate gradient information from a little bit larger parts of the objective landscape. 257 00:28:34,702 --> 00:28:39,351 This also has some really nice theoretical properties when it comes to convex optimization, 258 00:28:39,351 --> 00:28:45,946 but those guarantees go a little bit out the window once it comes to non-convex problems like neural networks. 259 00:28:45,946 --> 00:28:51,061 Writing it down in equations, Nesterov momentum looks something like this, where now 260 00:28:51,061 --> 00:28:57,155 to update our velocity, we take a step, according to our previous velocity, and evaluate that gradient there. 261 00:28:57,155 --> 00:29:06,222 Now when we take our next step, we actually step in the direction of our velocity that's incorporating information from these multiple points. 262 00:29:06,222 --> 00:29:07,055 Question? 263 00:29:08,437 --> 00:29:12,357 - [Student] [speaks too low to hear] - Oh, sorry. 264 00:29:12,357 --> 00:29:14,743 The question is what's a good initialization for the velocity? 265 00:29:14,743 --> 00:29:16,998 This is almost always zero. 266 00:29:16,998 --> 00:29:20,096 It's not even a hyperparameter. Just set it to zero and don't worry. 267 00:29:20,096 --> 00:29:21,315 Another question? 268 00:29:21,315 --> 00:29:25,482 - [Student] [speaks too low to hear] 269 00:29:31,992 --> 00:29:38,068 - Intuitively, the velocity is kind of a weighted sum of your gradients that you've seen over time. 270 00:29:38,068 --> 00:29:41,466 - [Student] [speaks too low to hear] 271 00:29:41,466 --> 00:29:44,027 - With more recent gradients being weighted heavier. 272 00:29:44,027 --> 00:29:49,716 At every time step, we take our old velocity, we decay by friction and we add in our current gradient. 273 00:29:49,716 --> 00:29:54,662 You can kind of think of this as a smooth moving average of your recent gradients 274 00:29:54,662 --> 00:30:00,109 with kind of a exponentially decaying weight on your gradients going back in time. 275 00:30:02,627 --> 00:30:11,632 This Nesterov formulation is a little bit annoying 'cause if you look at this, normally when you have your loss function, you want to evaluate your loss and your gradient at the same point. 276 00:30:11,632 --> 00:30:19,283 Nesterov breaks this a little bit. It's a little bit annoying to work with. Thankfully, there's a cute change of variables you can do. 277 00:30:19,283 --> 00:30:29,392 If you do the change of variables and reshuffle a little bit, then you can write Nesterov momentum in a slightly different way that now, again, lets you evaluate the loss and the gradient at the same point always. 278 00:30:29,392 --> 00:30:34,093 Once you make this change of variables, you get kind of a nice interpretation of Nesterov, 279 00:30:34,093 --> 00:30:41,739 which is that here in the first step, this looks exactly like updating the velocity in the vanilla SGD momentum case, where we 280 00:30:41,739 --> 00:30:48,178 have our current velocity, we evaluate gradient at the current point and mix these two together in a decaying way. 281 00:30:48,178 --> 00:30:51,951 Now in the second update, now when we're actually updating our parameter vector, if you look 282 00:30:51,951 --> 00:30:57,592 at the second equation, we have our current point plus our current velocity plus 283 00:30:57,592 --> 00:31:01,454 a weighted difference between our current velocity and our previous velocity. 284 00:31:01,454 --> 00:31:11,271 Here, Nesterov momentum is kind of incorporating some kind of error-correcting term between your current velocity and your previous velocity. 285 00:31:13,029 --> 00:31:25,249 If we look at SGD, SGD momentum and Nesterov momentum on this kind of simple problem, compared with SGD, we notice that SGD kind of takes this, SGD is in the black, kind of taking this slow progress toward the minima. 286 00:31:26,346 --> 00:31:29,598 The blue and the green show momentum and Nesterov. 287 00:31:29,598 --> 00:31:36,803 These have this behavior of kind of overshooting the minimum 'cause they're building up velocity going past the minimum, and then kind of 288 00:31:36,803 --> 00:31:39,849 correcting themselves and coming back towards the minima. 289 00:31:39,849 --> 00:31:40,682 Question? 290 00:31:42,023 --> 00:31:46,190 - [Student] [speaks too low to hear] 291 00:31:52,024 --> 00:31:58,050 - The question is this picture looks good, but what happens if your minima call lies in this very narrow basin? 292 00:31:58,050 --> 00:32:01,527 Will the velocity just cause you to skip right over that minima? 293 00:32:01,527 --> 00:32:05,232 That's actually a really interesting point, and the subject of some recent theoretical work, 294 00:32:05,232 --> 00:32:09,071 but the idea is that maybe those really sharp minima are actually bad minima. 295 00:32:09,071 --> 00:32:17,601 We don't want to even land in those 'cause the idea is that maybe if you have a very sharp minima, that actually could be a minima that overfits more. 296 00:32:17,601 --> 00:32:22,026 If you imagine that we doubled our training set, the whole optimization landscape would change, 297 00:32:22,026 --> 00:32:27,420 and maybe that very sensitive minima would actually disappear if we were to collect more training data. 298 00:32:27,420 --> 00:32:31,189 We kind of have this intuition that we maybe want to land in very flat minima 299 00:32:31,189 --> 00:32:35,933 because those very flat minima are probably more robust as we change the training data. 300 00:32:35,933 --> 00:32:40,453 Those flat minima might actually generalize better to testing data. 301 00:32:40,453 --> 00:32:46,284 This is again, sort of very recent theoretical work, but that's actually a really good point that you bring it up. 302 00:32:46,284 --> 00:32:54,354 In some sense, it's actually a feature and not a bug that SGD momentum actually skips over those very sharp minima. 303 00:32:55,979 --> 00:32:59,979 That's actually a good thing, believe it or not. 304 00:33:00,825 --> 00:33:04,316 Another thing you can see is if you look at the difference between momentum and Nesterov here, 305 00:33:04,316 --> 00:33:12,715 you can see that because of the correction factor in Nesterov, maybe it's not overshooting quite as drastically, compared to vanilla momentum. 306 00:33:14,683 --> 00:33:20,068 There's another kind of common optimization strategy is this algorithm called AdaGrad, 307 00:33:20,068 --> 00:33:25,292 which John Duchi, who's now a professor here, worked on during his Ph.D. 308 00:33:25,292 --> 00:33:37,663 The idea with AdaGrad is that as you, during the course of the optimization, you're going to keep a running estimate or a running sum of all the squared gradients that you see during training. 309 00:33:39,569 --> 00:33:43,957 Now rather than having a velocity term, instead we have this grad squared term. 310 00:33:43,957 --> 00:33:49,199 During training, we're going to just keep adding the squared gradients to this grad squared term. 311 00:33:49,199 --> 00:33:57,449 Now when we update our parameter vector, we'll divide by this grad squared term when we're making our update step. 312 00:33:59,334 --> 00:34:07,261 The question is what does this kind of scaling do in this situation where we have a very high condition number? 313 00:34:08,393 --> 00:34:12,560 - [Student] [speaks too low to hear] 314 00:34:16,256 --> 00:34:22,904 - The idea is that if we have two coordinates, one that always has a very high gradient and one that always has a very small gradient, 315 00:34:22,904 --> 00:34:35,181 then as we add the sum of the squares of the small gradient, we're going to be dividing by a small number, so we'll accelerate movement along the slow dimension, along the one dimension. 316 00:34:35,181 --> 00:34:45,924 Then along the other dimension, where the gradients tend to be very large, then we'll be dividing by a large number, so we'll kind of slow down our progress along the wiggling dimension. 317 00:34:45,924 --> 00:34:56,093 But there's kind of a problem here. That's the question of what happens with AdaGrad over the course of training, as t gets larger and larger and larger? 318 00:34:56,094 --> 00:34:58,391 - [Student] [speaks too low to hear] 319 00:34:58,391 --> 00:35:02,239 - With AdaGrad, the steps actually get smaller and smaller and smaller because we just 320 00:35:02,239 --> 00:35:09,895 continue updating this estimate of the squared gradients over time, so this estimate just grows and grows and grows monotonically over the course of training. 321 00:35:09,895 --> 00:35:15,359 Now this causes our step size to get smaller and smaller and smaller over time. 322 00:35:15,359 --> 00:35:20,334 Again, in the convex case, there's some really nice theory showing that this is actually 323 00:35:20,334 --> 00:35:28,125 really good 'cause in the convex case, as you approach a minimum, you kind of want to slow down so you actually converge. 324 00:35:28,125 --> 00:35:31,192 That's actually kind of a feature in the convex case. 325 00:35:31,192 --> 00:35:42,007 But in the non-convex case, that's a little bit problematic because as you come towards a saddle point, you might get stuck with AdaGrad, and then you kind of no longer make any progress. 326 00:35:42,007 --> 00:35:48,678 There's a slight variation of AdaGrad, called RMSProp, that actually addresses this concern a little bit. 327 00:35:48,678 --> 00:35:53,390 Now with RMSProp, we still keep this estimate of the squared gradients, but instead 328 00:35:53,390 --> 00:36:01,085 of just letting that squared estimate continually accumulate over training, instead, we let that squared estimate actually decay. 329 00:36:01,085 --> 00:36:09,340 This ends up looking kind of like a momentum update, except we're having kind of momentum over the squared gradients, rather than momentum over the actual gradients. 330 00:36:09,340 --> 00:36:20,361 Now with RMSProp, after we compute our gradient, we take our current estimate of the grad squared, we multiply it by this decay rate, which is commonly something like .9 or .99. 331 00:36:20,361 --> 00:36:26,601 Then we add in this one minus the decay rate of our current squared gradient. 332 00:36:26,601 --> 00:36:37,193 Now over time, you can imagine that. Then again, when we make our step, the step looks exactly the same as AdaGrad, where we divide by the squared gradient 333 00:36:37,193 --> 00:36:44,070 in the step to again have this nice property of accelerating movement along the one dimension, and slowing down movement along the other dimension. 334 00:36:44,070 --> 00:36:52,411 But now with RMSProp, because these estimates are leaky, then it kind of addresses the problem of maybe always slowing down where you might not want to. 335 00:36:56,455 --> 00:37:04,173 Here again, we're kind of showing our favorite toy problem with SGD in black, SGD momentum in blue and RMSProp in red. 336 00:37:04,173 --> 00:37:12,263 You can see that RMSProp and SGD momentum are both doing much better than SGD, but their qualitative behavior is a little bit different. 337 00:37:12,263 --> 00:37:17,488 With SGD momentum, it kind of overshoots the minimum and comes back, whereas with 338 00:37:17,488 --> 00:37:26,392 RMSProp, it's kind of adjusting its trajectory such that we're making approximately equal progress among all the dimensions. 339 00:37:26,392 --> 00:37:34,412 By the way, you can't actually tell, but this plot is also showing AdaGrad in green with the same learning rate, but it just 340 00:37:34,412 --> 00:37:38,606 gets stuck due to this problem of continually decaying learning rates. 341 00:37:38,606 --> 00:37:45,553 In practice, AdaGrad is maybe not so common for many of these things. That's a little bit of an unfair comparison of AdaGrad. 342 00:37:46,392 --> 00:37:52,558 Probably you need to increase the learning rate with AdaGrad, and then it would end up looking kind of like RMSProp in this case. 343 00:37:52,558 --> 00:37:57,148 But in general, we tend not to use AdaGrad so much when training neural networks. 344 00:37:57,148 --> 00:37:57,981 Question? 345 00:37:57,981 --> 00:37:59,796 - [Student] [speaks too low to hear] 346 00:37:59,796 --> 00:38:04,387 - The answer is yes, this problem is convex, but in this case, 347 00:38:07,146 --> 00:38:12,003 it's a little bit of an unfair comparison because the learning rates are not so comparable among the methods. 348 00:38:12,003 --> 00:38:17,290 I've been a little bit unfair to AdaGrad in this visualization by showing the same learning rate between the different algorithms, 349 00:38:17,290 --> 00:38:23,132 when probably you should have separately turned the learning rates per algorithm. 350 00:38:27,970 --> 00:38:35,403 We saw in momentum, we had this idea of velocity, where we're building up velocity by adding in the gradients, and then stepping in the direction of the velocity. 351 00:38:35,403 --> 00:38:43,744 We saw with AdaGrad and RMSProp that we had this other idea, of building up an estimate of the squared gradients, and then dividing by the squared gradients. 352 00:38:43,744 --> 00:38:46,658 Then these both seem like good ideas on their own. 353 00:38:46,658 --> 00:38:50,898 Why don't we just stick 'em together and use them both? Maybe that would be even better. 354 00:38:50,898 --> 00:38:56,741 That brings us to this algorithm called Adam, or rather brings us very close to Adam. 355 00:38:56,741 --> 00:39:01,119 We'll see in a couple slides that there's a slight correction we need to make here. 356 00:39:01,119 --> 00:39:06,888 Here with Adam, we maintain an estimate of the first moment and the second moment. 357 00:39:06,888 --> 00:39:14,667 Now in the red, we make this estimate of the first moment as a weighed sum of our gradients. 358 00:39:14,667 --> 00:39:22,741 We have this moving estimate of the second moment, like AdaGrad and like RMSProp, which is a moving estimate of our squared gradients. 359 00:39:22,741 --> 00:39:28,621 Now when we make our update step, we step using both the first moment, which is kind of our 360 00:39:28,621 --> 00:39:37,281 velocity, and also divide by the second moment, or rather the square root of the second moment, which is this squared gradient term. 361 00:39:38,128 --> 00:39:46,269 This idea of Adam ends up looking a little bit like RMSProp plus momentum, or ends up looking like momentum plus second squared gradients. 362 00:39:46,269 --> 00:39:51,989 It kind of incorporates the nice properties of both. But there's a little bit of a problem here. 363 00:39:51,989 --> 00:40:06,134 That's the question of what happens at the very first time step? At the very first time step, you can see that at the beginning, we've initialized our second moment with zero. 364 00:40:06,134 --> 00:40:16,803 Now after one update of the second moment, typically this beta two, second moment decay rate, is something like .9 or .99, something very close to one. 365 00:40:18,235 --> 00:40:22,867 After one update, our second moment is still very, very close to zero. 366 00:40:22,867 --> 00:40:32,377 Now when we're making our update step here and we divide by our second moment, now we're dividing by a very small number. We're making a very, very large step at the beginning. 367 00:40:32,377 --> 00:40:37,768 This very, very large step at the beginning is not really due to the geometry of the problem. 368 00:40:37,768 --> 00:40:43,422 It's kind of an artifact of the fact that we initialized our second moment estimate was zero. 369 00:40:43,422 --> 00:40:44,322 Question? 370 00:40:44,322 --> 00:40:48,489 - [Student] [speaks too low to hear] 371 00:40:52,832 --> 00:41:00,365 - That's true. The comment is that if your first moment is also very small, then you're multiplying by small and you're dividing by square root 372 00:41:00,365 --> 00:41:02,906 of small squared, so what's going to happen? 373 00:41:02,906 --> 00:41:05,746 They might cancel each other out, you might be okay. 374 00:41:05,746 --> 00:41:13,632 That's true. Sometimes these cancel each other out and you're okay, but sometimes this ends up in taking very large steps right at the beginning. 375 00:41:13,632 --> 00:41:16,245 That can be quite bad. 376 00:41:16,245 --> 00:41:19,533 Maybe you initialize a little bit poorly. You take a very large step. 377 00:41:19,533 --> 00:41:26,145 Now your initialization is completely messed up, and then you're in a very bad part of the objective landscape and you just can't converge from there. 378 00:41:26,145 --> 00:41:27,165 Question? 379 00:41:27,165 --> 00:41:30,915 - [Student] [speaks too low to hear] 380 00:41:30,915 --> 00:41:37,847 - The idea is what is this 10 to the minus seven term in the last equation? That's actually appeared in AdaGrad, RMSProp and Adam. 381 00:41:37,847 --> 00:41:42,187 The idea is that we're dividing by something. We want to make sure we're not dividing by zero, 382 00:41:42,187 --> 00:41:48,609 so we always add a small positive constant to the denominator, just to make sure we're not dividing by zero. 383 00:41:48,609 --> 00:41:56,012 That's technically a hyperparameter, but it tends not to matter too much, so just setting 10 to minus seven, 10 to minus eight, something like that, tends to work well. 384 00:41:57,967 --> 00:42:05,511 With Adam, remember we just talked about this idea of at the first couple steps, it gets very large, and we might take very large steps and mess ourselves up. 385 00:42:05,511 --> 00:42:12,510 Adam also adds this bias correction term to avoid this problem of taking very large steps at the beginning. 386 00:42:12,510 --> 00:42:22,619 You can see that after we update our first and second moments, we create an unbiased estimate of those first and second moments by incorporating the current time step, t. 387 00:42:22,619 --> 00:42:29,550 Now we actually make our step using these unbiased estimates, rather than the original first and second moment estimates. 388 00:42:29,550 --> 00:42:33,167 This gives us our full form of Adam. 389 00:42:33,167 --> 00:42:45,550 By the way, Adam is a really, [laughs] really good optimization algorithm, and it works really well for a lot of different problems, so that's kind of my default optimization algorithm for just about any new problem that I'm tackling. 390 00:42:45,550 --> 00:42:53,088 In particular, if you set beta one equals .9, beta two equals .999, learning rate one e minus three or five e minus four, 391 00:42:53,088 --> 00:42:58,797 that's a great staring point for just about all the architectures I've ever worked with. 392 00:42:58,797 --> 00:43:03,518 Try that. That's a really good place to start, in general. 393 00:43:03,518 --> 00:43:05,949 [laughs] 394 00:43:05,949 --> 00:43:11,634 If we actually plot these things out and look at SGD, SGD momentum, RMSProp and Adam on the same problem, 395 00:43:11,634 --> 00:43:18,094 you can see that Adam, in the purple here, kind of combines elements of both SGD momentum and RMSProp. 396 00:43:18,094 --> 00:43:25,175 Adam kind of overshoots the minimum a little bit like SGD momentum, but it doesn't overshoot quite as much as momentum. 397 00:43:25,175 --> 00:43:33,268 Adam also has this similar behavior of RMSProp of kind of trying to curve to make equal progress along all dimensions. 398 00:43:33,268 --> 00:43:37,706 Maybe in this small two-dimensional example, Adam converged about similarly to other ones, 399 00:43:37,706 --> 00:43:42,832 but you can see qualitatively that it's kind of combining the behaviors of both momentum and RMSProp. 400 00:43:45,042 --> 00:43:48,709 Any questions about optimization algorithms? 401 00:43:50,048 --> 00:43:56,606 - [Student] [speaks too low to hear] They still take a very long time to train. [speaks too low to hear] 402 00:43:56,606 --> 00:44:03,193 - The question is what does Adam not fix? Would these neural networks are still large, they still take a long time to train. 403 00:44:04,744 --> 00:44:07,098 There can still be a problem. 404 00:44:07,098 --> 00:44:11,979 In this picture where we have this landscape of things looking like ovals, if you imagine 405 00:44:11,979 --> 00:44:19,219 that we're kind of making estimates along each dimension independently to allow us to speed up or slow down along different 406 00:44:19,219 --> 00:44:26,576 coordinate axes, but one problem is that if that taco shell is kind of tilted and is not axis aligned, then we're still 407 00:44:26,576 --> 00:44:29,887 only making estimates along the individual axes independently. 408 00:44:30,935 --> 00:44:38,131 That corresponds to taking your rotated taco shell and squishing it horizontally and vertically, but you can't actually unrotate it. 409 00:44:38,131 --> 00:44:48,732 In cases where you have this kind of rotated picture of poor conditioning, then Adam or any of these other algorithms really can't address that, that concern. 410 00:44:51,356 --> 00:44:57,706 Another thing that we've seen in all these optimization algorithms is learning rate as a hyperparameter. 411 00:44:57,706 --> 00:45:01,828 We've seen this picture before a couple times, that as you use different learning rates, 412 00:45:01,828 --> 00:45:05,097 sometimes if it's too high, it might explode in the yellow. 413 00:45:05,097 --> 00:45:09,629 If it's a very low learning rate, in the blue, it might take a very long time to converge. 414 00:45:09,629 --> 00:45:11,933 It's kind of tricky to pick the right learning rate. 415 00:45:13,712 --> 00:45:19,308 This is a little bit of a trick question because we don't actually have to stick with one learning rate throughout the course of training. 416 00:45:19,308 --> 00:45:29,705 Sometimes you'll see people decay the learning rates over time, where we can kind of combine the effects of these different curves on the left, and get the nice properties of each. 417 00:45:29,705 --> 00:45:39,366 Sometimes you'll start with a higher learning rate near the start of training, and then decay the learning rate and make it smaller and smaller throughout the course of training. 418 00:45:39,366 --> 00:45:46,795 A couple strategies for these would be a step decay, where at 100,000th iteration, you just decay by some factor and you keep going. 419 00:45:46,795 --> 00:45:52,579 You might see an exponential decay, where you continually decay during training. 420 00:45:52,579 --> 00:45:57,598 You might see different variations of continually decaying the learning rate during training. 421 00:45:57,598 --> 00:46:04,347 If you look at papers, especially the resonate paper, you often see plots that look kind of like this, where the loss is kind of going down, 422 00:46:04,347 --> 00:46:07,898 then dropping, then flattening again, then dropping again. 423 00:46:07,898 --> 00:46:11,312 What's going on in these plots is that they're using a step decay learning rate, 424 00:46:11,312 --> 00:46:18,401 where at these parts where it plateaus and then suddenly drops again, those are the iterations where they dropped the learning rate by some factor. 425 00:46:18,401 --> 00:46:26,243 This idea of dropping the learning rate, you might imagine that it got near some good region, but now the gradients got smaller, 426 00:46:26,243 --> 00:46:28,066 it's kind of bouncing around too much. 427 00:46:28,066 --> 00:46:32,745 Then if we drop the learning rate, it lets it slow down and continue to make progress down the landscape. 428 00:46:32,745 --> 00:46:36,475 This tends to help in practice sometimes. 429 00:46:36,475 --> 00:46:44,973 Although one thing to point out is that learning rate decay is a little bit more common with SGD momentum, and a little bit less common with something like Adam. 430 00:46:44,973 --> 00:46:50,458 Another thing I'd like to point out is that learning rate decay is kind of a second-order hyperparameter. 431 00:46:50,458 --> 00:46:53,324 You typically should not optimize over this thing from the start. 432 00:46:53,324 --> 00:47:00,877 Usually when you're kind of getting networks to work at the beginning, you want to pick a good learning rate with no learning rate decay from the start. 433 00:47:00,877 --> 00:47:06,068 Trying to cross-validate jointly over learning rate decay and initial learning rate and other things, you'll just get confused. 434 00:47:06,068 --> 00:47:10,581 What you do for setting learning rate decay is try with no decay, see what happens. 435 00:47:10,581 --> 00:47:15,427 Then kind of eyeball the loss curve and see where you think you might need decay. 436 00:47:16,860 --> 00:47:24,948 Another thing I wanted to mention briefly is this idea of all these algorithms that we've talked about are first-order optimization algorithms. 437 00:47:24,948 --> 00:47:33,064 In this picture, in this one-dimensional picture, we have this kind of curvy objective function at our current point in red. 438 00:47:33,064 --> 00:47:36,057 What we're basically doing is computing the gradient at that point. 439 00:47:36,057 --> 00:47:40,722 We're using the gradient information to compute some linear approximation to our function, 440 00:47:40,722 --> 00:47:44,208 which is kind of a first-order Taylor approximation to our function. 441 00:47:44,208 --> 00:47:51,814 Now we pretend that the first-order approximation is our actual function, and we make a step to try to minimize the approximation. 442 00:47:51,814 --> 00:47:57,353 But this approximation doesn't hold for very large regions, so we can't step too far in that direction. 443 00:47:57,353 --> 00:48:04,509 But really, the idea here is that we're only incorporating information about the first derivative of the function. You can actually go a little bit fancier. 444 00:48:04,509 --> 00:48:11,248 There's this idea of second-order approximation, where we take into account both first derivative and second derivative information. 445 00:48:11,248 --> 00:48:18,449 Now we make a second-order Taylor approximation to our function and kind of locally approximate our function with a quadratic. 446 00:48:18,449 --> 00:48:22,281 Now with a quadratic, you can step right to the minimum, and you're really happy. 447 00:48:22,281 --> 00:48:25,769 That's this idea of second-order optimization. 448 00:48:25,769 --> 00:48:30,489 When you generalize this to multiple dimensions, you get something called the Newton step, 449 00:48:30,489 --> 00:48:35,066 where you compute this Hessian matrix, which is a matrix of second derivatives, 450 00:48:35,066 --> 00:48:43,689 and you end up inverting this Hessian matrix in order to step directly to the minimum of this quadratic approximation to your function. 451 00:48:43,689 --> 00:48:48,910 Does anyone spot something that's quite different about this update rule, compared to the other ones that we've seen? 452 00:48:48,910 --> 00:48:51,107 - [Student] [speaks too low to hear] 453 00:48:51,107 --> 00:48:54,328 - This doesn't have a learning rate. That's kind of cool. 454 00:48:56,463 --> 00:49:00,664 We're making this quadratic approximation and we're stepping right to the minimum of the quadratic. 455 00:49:00,664 --> 00:49:04,681 At least in this vanilla version of Newton's method, you don't actually need a learning rate. 456 00:49:04,681 --> 00:49:07,849 You just always step to the minimum at every time step. 457 00:49:07,849 --> 00:49:13,265 However, in practice, you might end up, have a learning rate anyway because, again, that quadratic approximation might not be perfect, 458 00:49:13,265 --> 00:49:21,055 so you might only want to step in the direction towards the minimum, rather than actually stepping to the minimum, but at least in this vanilla version, it doesn't have a learning rate. 459 00:49:23,994 --> 00:49:27,366 But unfortunately, this is maybe a little bit impractical for deep learning 460 00:49:27,366 --> 00:49:34,519 because this Hessian matrix is N by N, where N is the number of parameters in your network. 461 00:49:34,519 --> 00:49:38,498 If N is 100 million, then 100 million squared is way too big. 462 00:49:38,498 --> 00:49:42,046 You definitely can't store that in memory, and you definitely can't invert it. 463 00:49:42,046 --> 00:49:46,486 In practice, people sometimes use these quasi-Newton methods that, rather than working 464 00:49:46,486 --> 00:49:52,725 with the full Hessian and inverting the full Hessian, they work with approximations. Low-rank approximations are common. 465 00:49:52,725 --> 00:49:57,092 You'll sometimes see these for some problems. 466 00:49:57,092 --> 00:50:03,487 L-BFGS is one particular second-order optimizer that has this approximate second, keeps this approximation of the Hessian 467 00:50:03,487 --> 00:50:11,205 that you'll sometimes see, but in practice, it doesn't work too well for many deep learning problems because these approximations, 468 00:50:11,205 --> 00:50:16,410 these second-order approximations, don't really handle the stochastic case very much, very nicely. 469 00:50:16,410 --> 00:50:20,616 They also tend not to work so well with non-convex problems. 470 00:50:20,616 --> 00:50:23,142 I don't want to get into that right now too much. 471 00:50:23,142 --> 00:50:29,022 In practice, what you should really do is probably Adam is a really good choice for many different neural network things, 472 00:50:29,022 --> 00:50:38,974 but if you're in a situation where you can afford to do full batch updates, and you know that your problem doesn't have really any stochasticity, then L-BFGS is kind of a good choice. 473 00:50:38,974 --> 00:50:43,181 L-BFGS doesn't really get used for training neural networks too much, but as we'll see 474 00:50:43,181 --> 00:50:47,251 in a couple of lectures, it does sometimes get used for things like style transfer, 475 00:50:47,251 --> 00:50:54,356 where you actually have less stochasticity and fewer parameters, but you still want to solve an optimization problem. 476 00:50:55,834 --> 00:51:00,992 All of these strategies we've talked about so far are about reducing training error. 477 00:51:02,344 --> 00:51:07,452 All these optimization algorithms are really about driving down your training error and minimizing your objective function, 478 00:51:07,452 --> 00:51:10,403 but we don't really care about training error that much. 479 00:51:10,403 --> 00:51:13,203 Instead, we really care about our performance on unseen data. 480 00:51:13,203 --> 00:51:16,817 We really care about reducing this gap between train and test error. 481 00:51:16,817 --> 00:51:21,228 The question is once we're already good at optimizing our objective function, 482 00:51:21,228 --> 00:51:25,535 what can we do to try to reduce this gap and make our model perform better on unseen data? 483 00:51:28,497 --> 00:51:33,617 One really quick and dirty, easy thing to try is this idea of model ensembles 484 00:51:33,617 --> 00:51:36,767 that sometimes works across many different areas in machine learning. 485 00:51:36,767 --> 00:51:44,588 The idea is pretty simple. Rather than having just one model, we'll train 10 different models independently from different initial random restarts. 486 00:51:44,588 --> 00:51:51,333 Now at test time, we'll run our data through all of the 10 models and average the predictions of those 10 models. 487 00:51:53,562 --> 00:52:01,555 Adding these multiple models together tends to reduce overfitting a little bit and tend to improve performance a little bit, typically by a couple percent. 488 00:52:01,555 --> 00:52:05,302 This is generally not a drastic improvement, but it is a consistent improvement. 489 00:52:05,302 --> 00:52:13,263 You'll see that in competitions, like ImageNet and other things like that, using model ensembles is very common to get maximal performance. 490 00:52:14,488 --> 00:52:20,482 You can actually get a little bit creative with this. Sometimes rather than training separate models independently, you can just keep multiple 491 00:52:20,482 --> 00:52:25,928 snapshots of your model during the course of training, and then use these as your ensembles. 492 00:52:25,928 --> 00:52:29,804 Then you still, at test time, need to average the predictions of these multiple snapshots, 493 00:52:29,804 --> 00:52:33,244 but you can collect the snapshots during the course of training. 494 00:52:34,133 --> 00:52:43,210 There's actually a very nice paper being presented at ICLR this week that kind of has a fancy version of this idea, where we use a crazy learning rate schedule, 495 00:52:43,210 --> 00:52:47,996 where our learning rate goes very slow, then very fast, then very slow, then very fast. 496 00:52:47,996 --> 00:52:57,631 The idea is that with this crazy learning rate schedule, then over the course of training, the model might be able to converge to different regions in the objective landscape that all are reasonably good. 497 00:52:58,717 --> 00:53:05,532 If you do an ensemble over these different snapshots, then you can improve your performance quite nicely, even though you're only training the model once. 498 00:53:05,532 --> 00:53:11,198 Questions? - [Student] [speaks too low to hear] 499 00:53:25,388 --> 00:53:33,413 - The question is, it's bad when there's a large gap between error 'cause that means you're overfitting, but if there's no gap, then is that also maybe bad? 500 00:53:33,413 --> 00:53:37,446 Do we actually want some small, optimal gap between the two? 501 00:53:37,446 --> 00:53:39,132 We don't really care about the gap. 502 00:53:39,132 --> 00:53:44,019 What we really care about is maximizing the performance on the validation set. 503 00:53:44,019 --> 00:53:54,995 What tends to happen is that if you don't see a gap, then you could have improved your absolute performance, in many cases, by overfitting a little bit more. 504 00:53:54,995 --> 00:54:02,720 There's this weird correlation between the absolute performance on the validation set and the size of that gap. We only care about absolute performance. 505 00:54:02,720 --> 00:54:03,735 Question in the back? 506 00:54:03,735 --> 00:54:07,004 - [Student] Are hyperparameters the same for the ensemble? 507 00:54:07,004 --> 00:54:09,528 - Are the hyperparameters the same for the ensembles? 508 00:54:09,528 --> 00:54:12,234 That's a good question. Sometimes they're not. 509 00:54:12,234 --> 00:54:19,614 You might want to try different sizes of the model, different learning rates, different regularization strategies and ensemble across these different things. 510 00:54:19,614 --> 00:54:22,614 That actually does happen sometimes. 511 00:54:23,496 --> 00:54:31,769 Another little trick you can do sometimes is that during training, you might actually keep an exponentially decaying average of your parameter vector itself to kind of have 512 00:54:31,769 --> 00:54:35,778 a smooth ensemble of your own network during training. 513 00:54:35,778 --> 00:54:41,649 Then use this smoothly decaying average of your parameter vector, rather than the actual checkpoints themselves. 514 00:54:41,649 --> 00:54:45,262 This is called Polyak averaging, and it sometimes helps a little bit. 515 00:54:45,262 --> 00:54:50,838 It's just another one of these small tricks you can sometimes add, but it's not maybe too common in practice. 516 00:54:50,838 --> 00:54:55,778 Another question you might have is that how can we actually improve the performance of single models? 517 00:54:57,229 --> 00:55:02,503 When we have ensembles, we still need to run, like, 10 models at test time. That's not so great. 518 00:55:02,503 --> 00:55:06,219 We really want some strategies to improve the performance of our single models. 519 00:55:06,219 --> 00:55:08,237 That's really this idea of regularization, 520 00:55:08,237 --> 00:55:11,954 where we add something to our model to prevent it from fitting the training data 521 00:55:11,954 --> 00:55:16,203 too well in the attempts to make it perform better on unseen data. 522 00:55:16,203 --> 00:55:23,515 We've seen a couple ideas, a couple methods for regularization already, where we add some explicit extra term to the loss. 523 00:55:23,515 --> 00:55:29,738 Where we have this one term telling the model to fit the data, and another term that's a regularization term. 524 00:55:29,738 --> 00:55:33,032 You saw this in homework one, where we used L2 regularization. 525 00:55:34,804 --> 00:55:43,001 As we talked about in lecture a couple lectures ago, this L2 regularization doesn't really make maybe a lot of sense in the context of neural networks. 526 00:55:43,922 --> 00:55:47,982 Sometimes we use other things for neural networks. 527 00:55:47,982 --> 00:55:53,376 One regularization strategy that's super, super common for neural networks is this idea of dropout. 528 00:55:53,376 --> 00:55:55,080 Dropout is super simple. 529 00:55:55,080 --> 00:56:02,264 Every time we do a forward pass through the network, at every layer, we're going to randomly set some neurons to zero. 530 00:56:02,264 --> 00:56:08,688 Every time we do a forward pass, we'll set a different random subset of the neurons to zero. This kind of proceeds one layer at a time. 531 00:56:08,688 --> 00:56:15,193 We run through one layer, we compute the value of the layer, we randomly set some of them to zero, and then we continue up through the network. 532 00:56:15,193 --> 00:56:22,445 Now if you look at this fully connected network on the left versus a dropout version of the same network on the right, you can see 533 00:56:22,445 --> 00:56:30,400 that after we do dropout, it kind of looks like a smaller version of the same network, where we're only using some subset of the neurons. 534 00:56:30,400 --> 00:56:35,746 This subset that we use varies at each iteration, at each forward pass. 535 00:56:35,746 --> 00:56:36,732 Question? 536 00:56:36,732 --> 00:56:40,899 - [Student] [speaks too low to hear] 537 00:56:43,694 --> 00:56:46,375 - The question is what are we setting to zero? It's the activations. 538 00:56:46,375 --> 00:56:51,731 Each layer is computing previous activation times the weight matrix gives you our next activation. 539 00:56:51,731 --> 00:57:01,592 Then you just take that activation, set some of them to zero, and then your next layer will be partially zeroed activations times another matrix give you your next activations. 540 00:57:01,592 --> 00:57:03,155 Question? 541 00:57:03,155 --> 00:57:06,702 - [Student] [speaks too low to hear] 542 00:57:06,702 --> 00:57:08,751 - Question is which layers do you do this on? 543 00:57:08,751 --> 00:57:14,454 It's more common in fully connected layers, but you sometimes see this in convolutional layers, as well. 544 00:57:14,454 --> 00:57:23,423 When you're working in convolutional layers, sometimes instead of dropping each activation randomly, instead you sometimes might drop entire feature maps randomly. 545 00:57:24,455 --> 00:57:30,117 In convolutions, you have this channel dimension, and you might drop out entire channels, rather than random elements. 546 00:57:32,059 --> 00:57:38,480 Dropout is kind of super simple in practice. It only requires adding two lines, one line per dropout call. 547 00:57:38,480 --> 00:57:41,572 Here we have a three-layer neural network, and we've added dropout. 548 00:57:41,572 --> 00:57:49,460 You can see that all we needed to do was add this extra line where we randomly set some things to zero. This is super easy to implement. 549 00:57:49,460 --> 00:57:52,138 But the question is why is this even a good idea? 550 00:57:52,138 --> 00:57:58,067 We're seriously messing with the network at training time by setting a bunch of its values to zero. 551 00:57:58,067 --> 00:58:00,988 How can this possibly make sense? 552 00:58:00,988 --> 00:58:08,532 One sort of slightly hand wavy idea that people have is that dropout helps prevent co-adaptation of features. 553 00:58:09,622 --> 00:58:15,853 Maybe if you imagine that we're trying to classify cats, maybe in some universe, the network might learn one neuron 554 00:58:15,853 --> 00:58:21,066 for having an ear, one neuron for having a tail, one neuron for the input being furry. 555 00:58:21,066 --> 00:58:24,751 Then it kind of combines these things together to decide whether or not it's a cat. 556 00:58:24,751 --> 00:58:32,831 But now if we have dropout, then in making the final decision about catness, the network cannot depend too much on any of these one features. 557 00:58:32,831 --> 00:58:37,725 Instead, it kind of needs to distribute its idea of catness across many different features. 558 00:58:37,725 --> 00:58:42,205 This might help prevent overfitting somehow. 559 00:58:42,205 --> 00:58:50,347 Another interpretation of dropout that's come out a little bit more recently is that it's kind of like doing model ensembling within a single model. 560 00:58:51,690 --> 00:58:58,745 If you look at the picture on the left, after you apply dropout to the network, we're kind of computing this subnetwork using some subset of the neurons. 561 00:58:58,745 --> 00:59:03,391 Now every different potential dropout mask leads to a different potential subnetwork. 562 00:59:03,391 --> 00:59:09,145 Now dropout is kind of learning a whole ensemble of networks all at the same time that all share parameters. 563 00:59:09,145 --> 00:59:13,790 By the way, because of the number of potential dropout masks grows exponentially in the number 564 00:59:13,790 --> 00:59:17,152 of neurons, you're never going to sample all of these things. 565 00:59:18,089 --> 00:59:24,788 This is really a gigantic, gigantic ensemble of networks that are all being trained simultaneously. 566 00:59:25,622 --> 00:59:29,128 Then the question is what happens at test time? 567 00:59:29,128 --> 00:59:34,158 Once we move to dropout, we've kind of fundamentally changed the operation of our neural network. 568 00:59:34,158 --> 00:59:42,850 Previously, we've had our neural network, f, be a function of the weights, w, and the inputs, x, and then produce the output, y. 569 00:59:42,850 --> 00:59:48,268 But now, our network is also taking this additional input, z, which is some random dropout mask. 570 00:59:48,268 --> 00:59:52,732 That z is random. Having randomness at test time is maybe bad. 571 00:59:52,732 --> 00:59:57,444 Imagine that you're working at Facebook, and you want to classify the images that people are uploading. 572 00:59:57,444 --> 01:00:03,092 Then today, your image gets classified as a cat, and tomorrow it doesn't. That would be really weird and really bad. 573 01:00:03,092 --> 01:00:09,323 You'd probably want to eliminate this stochasticity at test time once the network is already trained. 574 01:00:09,323 --> 01:00:12,093 Then we kind of want to average out this randomness. 575 01:00:12,093 --> 01:00:18,131 If you write this out, you can imagine actually marginalizing out this randomness with some integral, but in practice, 576 01:00:18,131 --> 01:00:24,368 this integral is totally intractable. We don't know how to evaluate this thing. You're in bad shape. 577 01:00:24,368 --> 01:00:28,073 One thing you might imagine doing is approximating this integral via sampling, 578 01:00:28,073 --> 01:00:31,484 where you draw multiple samples of z and then average them out at test time, 579 01:00:31,484 --> 01:00:36,040 but this still would introduce some randomness, which is little bit bad. 580 01:00:36,040 --> 01:00:41,423 Thankfully, in the case of dropout, we can actually approximate this integral in kind of a cheap way locally. 581 01:00:41,423 --> 01:00:47,228 If we consider a single neuron, the output is a, the inputs are x and y, with two weights, w one, w two. 582 01:00:47,228 --> 01:00:52,622 Then at test time, our value a is just w one x plus w two y. 583 01:00:53,590 --> 01:01:00,645 Now imagine that we trained to this network. During training, we used dropout with probability 1/2 of dropping our neurons. 584 01:01:00,645 --> 01:01:06,317 Now the expected value of a during training, we can kind of compute analytically for this small case. 585 01:01:07,712 --> 01:01:12,249 There's four possible dropout masks, and we're going to average out the values across these four masks. 586 01:01:12,249 --> 01:01:18,204 We can see that the expected value of a during training is 1/2 w one x plus w two y. 587 01:01:19,075 --> 01:01:29,000 There's this disconnect between this average value of w one x plus w two y at test time, and at training time, the average value is only 1/2 as much. 588 01:01:29,000 --> 01:01:34,883 One cheap thing we can do is that at test time, we don't have any stochasticity. 589 01:01:34,883 --> 01:01:40,736 Instead, we just multiply this output by the dropout probability. Now these expected values are the same. 590 01:01:40,736 --> 01:01:44,733 This is kind of like a local cheap approximation to this complex integral. 591 01:01:44,733 --> 01:01:48,576 This is what people really commonly do in practice with dropout. 592 01:01:49,715 --> 01:01:56,269 At dropout, we have this predict function, and we just multiply our outputs of the layer by the dropout probability. 593 01:01:56,269 --> 01:01:59,393 The summary of dropout is that it's really simple on the forward pass. 594 01:01:59,393 --> 01:02:03,807 You're just adding two lines to your implementation to randomly zero out some nodes. 595 01:02:03,807 --> 01:02:10,209 Then at the test time prediction function, you just added one little multiplication by your probability. 596 01:02:10,209 --> 01:02:16,613 Dropout is super simple. It tends to work well sometimes for regularizing neural networks. 597 01:02:16,613 --> 01:02:21,454 By the way, one common trick you see sometimes is this idea of inverted dropout. 598 01:02:22,665 --> 01:02:28,735 Maybe at test time, you care more about efficiency, so you want to eliminate that extra multiplication by p at test time. 599 01:02:28,735 --> 01:02:37,677 Then what you can do is, at test time, you use the entire weight matrix, but now at training time, instead you divide by p because training is probably happening on a GPU. 600 01:02:37,677 --> 01:02:44,733 You don't really care if you do one extra multiply at training time, but then at test time, you kind of want this thing to be as efficient as possible. 601 01:02:44,733 --> 01:02:45,566 Question? 602 01:02:46,416 --> 01:02:56,777 - [Student] [speaks too low to hear] Now the gradient [speaks too low to hear]. 603 01:02:57,678 --> 01:03:02,212 - The question is what happens to the gradient during training with dropout? 604 01:03:02,212 --> 01:03:06,583 You're right. We only end up propagating the gradients through the nodes that were not dropped. 605 01:03:06,583 --> 01:03:15,356 This has the consequence that when you're training with dropout, typically training takes longer because at each step, you're only updating some subparts of the network. 606 01:03:15,356 --> 01:03:22,287 When you're using dropout, it typically takes longer to train, but you might have a better generalization after it's converged. 607 01:03:24,409 --> 01:03:32,810 Dropout, we kind of saw is like this one concrete instantiation. There's a little bit more general strategy for regularization where during training 608 01:03:32,810 --> 01:03:37,482 we add some kind of randomness to the network to prevent it from fitting the training data too well. 609 01:03:37,482 --> 01:03:41,037 To kind of mess it up and prevent it from fitting the training data perfectly. 610 01:03:41,037 --> 01:03:46,160 Now at test time, we want to average out all that randomness to hopefully improve our generalization. 611 01:03:46,160 --> 01:03:53,927 Dropout is probably the most common example of this type of strategy, but actually batch normalization kind of fits this idea, as well. 612 01:03:53,927 --> 01:04:00,755 Remember in batch normalization, during training, one data point might appear in different mini batches with different other data points. 613 01:04:00,755 --> 01:04:07,200 There's a bit of stochasticity with respect to a single data point with how exactly that point gets normalized during training. 614 01:04:07,200 --> 01:04:14,735 But now at test time, we kind of average out this stochasticity by using some global estimates to normalize, rather than the per mini batch estimates. 615 01:04:14,735 --> 01:04:20,223 Actually batch normalization tends to have kind of a similar regularizing effect as dropout because they both introduce 616 01:04:20,223 --> 01:04:25,478 some kind of stochasticity or noise at training time, but then average it out at test time. 617 01:04:25,478 --> 01:04:35,744 Actually, when you train networks with batch normalization, sometimes you don't use dropout at all, and just the batch normalization adds enough of a regularizing effect to your network. 618 01:04:35,744 --> 01:04:43,833 Dropout is somewhat nice because you can actually tune the regularization strength by varying that parameter p, and there's no such control in batch normalization. 619 01:04:43,833 --> 01:04:48,928 Another kind of strategy that fits in this paradigm is this idea of data augmentation. 620 01:04:48,928 --> 01:04:57,078 During training, in a vanilla version for training, we have our data, we have our label. We use it to update our CNN at each time step. 621 01:04:57,078 --> 01:05:03,555 But instead, what we can do is randomly transform the image in some way during training such that the label is preserved. 622 01:05:03,555 --> 01:05:09,418 Now we train on these random transformations of the image rather than the original images. 623 01:05:09,418 --> 01:05:16,153 Sometimes you might see random horizontal flips 'cause if you take a cat and flip it horizontally, it's still a cat. 624 01:05:17,690 --> 01:05:23,763 You'll randomly sample crops of different sizes from the image because the random crop of the cat is still a cat. 625 01:05:25,188 --> 01:05:30,317 Then during testing, you kind of average out this stochasticity by evaluating with some 626 01:05:30,317 --> 01:05:34,309 fixed set of crops, often the four corners and the middle and their flips. 627 01:05:34,309 --> 01:05:38,041 What's very common is that when you read, for example, papers on ImageNet, they'll report 628 01:05:38,041 --> 01:05:47,308 a single crop performance of their model, which is just like the whole image, and a 10 crop performance of their model, which are these five standard crops plus their flips. 629 01:05:48,238 --> 01:05:56,345 Also with data augmentation, you'll sometimes use color jittering, where you might randomly vary the contrast or brightness of your image during training. 630 01:05:56,345 --> 01:06:04,642 You can get a little bit more complex with color jittering, as well, where you try to make color jitters that are maybe in the PCA directions of your data space or whatever, 631 01:06:04,642 --> 01:06:11,456 where you do some color jittering in some data-dependent way, but that's a little bit less common. 632 01:06:12,492 --> 01:06:18,037 In general, data augmentation is this really general thing that you can apply to just about any problem. 633 01:06:18,037 --> 01:06:24,940 Whatever problem you're trying to solve, you kind of think about what are the ways that I can transform my data without changing the label? 634 01:06:24,940 --> 01:06:31,218 Now during training, you just apply these random transformations to your input data. This sort of has a regularizing effect 635 01:06:31,218 --> 01:06:38,954 on the network because you're, again, adding some kind of stochasticity during training, and then marginalizing it out at test time. 636 01:06:40,055 --> 01:06:45,232 Now we've seen three examples of this pattern, dropout, batch normalization, data augmentation, 637 01:06:45,232 --> 01:06:47,154 but there's many other examples, as well. 638 01:06:47,154 --> 01:06:53,049 Once you have this pattern in your mind, you'll kind of recognize this thing as you read other papers sometimes. 639 01:06:53,049 --> 01:06:56,722 There's another kind of related idea to dropout called DropConnect. 640 01:06:56,722 --> 01:07:06,265 With DropConnect, it's the same idea, but rather than zeroing out the activations at every forward pass, instead we randomly zero out some of the values of the weight matrix instead. 641 01:07:06,265 --> 01:07:09,652 Again, it kind of has this similar flavor. 642 01:07:09,652 --> 01:07:16,281 Another kind of cool idea that I like, this one's not so commonly used, but I just think it's a really cool idea, 643 01:07:16,281 --> 01:07:19,400 is this idea of fractional max pooling. 644 01:07:19,400 --> 01:07:29,067 Normally when you do two-by-two max pooling, you have these fixed two-by-two regions over which you pool over in the forward pass, but now with fractional max pooling, 645 01:07:29,067 --> 01:07:35,851 every time we have our pooling layer, we're going to randomize exactly the pool that the regions over which we pool. 646 01:07:35,851 --> 01:07:43,070 Here in the example on the right, I've shown three different sets of random pooling regions that you might see during training. 647 01:07:43,070 --> 01:07:48,857 Now during test time, you kind of average the stochasticity out by trying many different, 648 01:07:48,857 --> 01:07:54,704 by either sticking to some fixed set of pooling regions. or drawing many samples and averaging over them. 649 01:07:54,704 --> 01:07:59,027 That's kind of a cool idea, even though it's not so commonly used. 650 01:07:59,027 --> 01:08:05,890 Another really kind of surprising paper in this paradigm that actually came out in the last year, so this is new since 651 01:08:05,890 --> 01:08:09,911 the last time we taught the class, is this idea of stochastic depth. 652 01:08:09,911 --> 01:08:15,490 Here we have a network on the left. The idea is that we have a very deep network. 653 01:08:15,490 --> 01:08:18,530 We're going to randomly drop layers from the network during training. 654 01:08:18,530 --> 01:08:24,113 During training, we're going to eliminate some layers and only use some subset of the layers during training. 655 01:08:24,114 --> 01:08:26,854 Now during test time, we'll use the whole network. 656 01:08:26,854 --> 01:08:30,251 This is kind of crazy. It's kind of amazing that this works, 657 01:08:30,251 --> 01:08:35,310 but this tends to have kind of a similar regularizing effect as dropout and these other strategies. 658 01:08:35,310 --> 01:08:42,041 But again, this is super, super cutting-edge research. This is not super commonly used in practice, but it is a cool idea. 659 01:08:44,694 --> 01:08:52,673 Any last minute questions about regularization? No? Use it. It's a good idea. Yeah? 660 01:08:52,673 --> 01:08:57,046 - [Student] [speaks too low to hear] 661 01:08:57,046 --> 01:09:01,184 - The question is do you usually use more than one regularization method? 662 01:09:04,325 --> 01:09:09,751 You should generally be using batch normalization as kind of a good thing to have in most networks nowadays because it 663 01:09:09,752 --> 01:09:12,650 helps you converge, especially for very deep things. 664 01:09:12,650 --> 01:09:25,204 In many cases, batch normalization alone tends to be enough, but then sometimes if batch normalization alone is not enough, then you can consider adding dropout or other thing once you see your network overfitting. 665 01:09:25,204 --> 01:09:28,526 You generally don't do a blind cross-validation over these things. 666 01:09:28,526 --> 01:09:33,942 Instead, you add them in in a targeted way once you see your network is overfitting. 667 01:09:36,400 --> 01:09:38,981 One quick thing, it's this idea of transfer learning. 668 01:09:38,981 --> 01:09:47,018 We've kind of seen with regularization, we can help reduce the gap between train and test error by adding these different regularization strategies. 669 01:09:48,903 --> 01:09:53,012 One problem with overfitting is sometimes you overfit 'cause you don't have enough data. 670 01:09:53,012 --> 01:10:00,444 You want to use a big, powerful model, but that big, powerful model just is going to overfit too much on your small dataset. 671 01:10:00,444 --> 01:10:05,909 Regularization is one way to combat that, but another way is through using transfer learning. 672 01:10:05,909 --> 01:10:12,730 Transfer learning kind of busts this myth that you don't need a huge amount of data in order to train a CNN. 673 01:10:12,730 --> 01:10:15,300 The idea is really simple. 674 01:10:15,300 --> 01:10:20,798 You'll maybe first take some CNN. Here is kind of a VGG style architecture. 675 01:10:20,798 --> 01:10:25,031 You'll take your CNN, you'll train it in a very large dataset, like ImageNet, 676 01:10:25,031 --> 01:10:28,039 where you actually have enough data to train the whole network. 677 01:10:28,039 --> 01:10:34,596 Now the idea is that you want to apply the features from this dataset to some small dataset that you care about. 678 01:10:34,596 --> 01:10:42,864 Maybe instead of classifying the 1,000 ImageNet categories, now you want to classify 10 dog breeds or something like that. You only have a small dataset. 679 01:10:42,864 --> 01:10:45,917 Here, our small dataset only has C classes. 680 01:10:45,917 --> 01:10:58,135 Then what you'll typically do is for this last fully connected layer that is going from the last layer features to the final class scores, this now, you need to reinitialize that matrix randomly. 681 01:10:59,651 --> 01:11:02,952 For ImageNet, it was a 4,096-by-1,000 dimensional matrix. 682 01:11:02,952 --> 01:11:09,182 Now for your new classes, it might be 4,096-by-C or by 10 or whatever. 683 01:11:09,182 --> 01:11:13,985 You reinitialize this last matrix randomly, freeze the weights of all the previous layers 684 01:11:13,985 --> 01:11:21,947 and now just basically train a linear classifier, and only train the parameters of this last layer and let it converge on your data. 685 01:11:23,788 --> 01:11:28,756 This tends to work pretty well if you only have a very small dataset to work with. 686 01:11:28,756 --> 01:11:35,166 Now if you have a little bit more data, another thing you can try is actually fine tuning the whole network. 687 01:11:35,166 --> 01:11:44,935 After that top layer converges and after you learn that last layer for your data, then you can consider actually trying to update the whole network, as well. 688 01:11:44,935 --> 01:11:49,434 If you have more data, then you might consider updating larger parts of the network. 689 01:11:49,434 --> 01:11:56,143 A general strategy here is that when you're updating the network, you want to drop the learning rate from its initial learning rate 690 01:11:56,143 --> 01:12:02,973 because probably the original parameters in this network that converged on ImageNet probably worked pretty well generally, 691 01:12:02,973 --> 01:12:08,605 and you just want to change them a very small amount to tune performance for your dataset. 692 01:12:08,605 --> 01:12:15,490 Then when you're working with transfer learning, you kind of imagine this two-by-two grid of scenarios where on the one side, you have 693 01:12:15,490 --> 01:12:20,113 maybe very small amounts of data for your dataset, or very large amount of data for your dataset. 694 01:12:21,188 --> 01:12:28,780 Then maybe your data is very similar to images. Like, ImageNet has a lot of pictures of animals and plants and stuff like that. 695 01:12:28,780 --> 01:12:35,335 If you want to just classify other types of animals and plants and other types of images like that, then you're in pretty good shape. 696 01:12:35,335 --> 01:12:48,861 Then generally what you do is if your data is very similar to something like ImageNet, if you have a very small amount of data, you can just basically train a linear classifier on top of features, extracted using an ImageNet model. 697 01:12:48,861 --> 01:12:54,786 If you have a little bit more data to work with, then you might imagine fine tuning your data. 698 01:12:54,786 --> 01:12:58,755 However, you sometimes get in trouble if your data looks very different from ImageNet. 699 01:12:58,755 --> 01:13:06,781 Maybe if you're working with maybe medical images that are X-rays or CAT scans or something that looks very different from images in ImageNet, in that case, 700 01:13:06,781 --> 01:13:09,072 you maybe need to get a little bit more creative. 701 01:13:09,072 --> 01:13:14,408 Sometimes it still works well here, but those last layer features might not be so informative. 702 01:13:14,408 --> 01:13:21,507 You might consider reinitializing larger parts of the network and getting a little bit more creative and trying more experiments here. 703 01:13:21,507 --> 01:13:29,015 This is somewhat mitigated if you have a large amount of data in your very different dataset 'cause then you can actually fine tune larger parts of the network. 704 01:13:29,015 --> 01:13:32,587 Another point I'd like to make is this idea of transfer learning is super pervasive. 705 01:13:32,587 --> 01:13:35,660 It's actually the norm, rather than the exception. 706 01:13:35,660 --> 01:13:40,562 As you read computer vision papers, you'll often see system diagrams like this for different tasks. 707 01:13:40,562 --> 01:13:44,706 On the left, we're working with object detection. On the right, we're working with image captioning. 708 01:13:44,706 --> 01:13:48,387 Both of these models have a CNN that's kind of processing the image. 709 01:13:48,387 --> 01:13:53,913 In almost all applications of computer vision these days, most people are not training these things from scratch. 710 01:13:53,913 --> 01:13:59,973 Almost always, that CNN will be pretrained on ImageNet, and then potentially fine tuned for the task at hand. 711 01:13:59,973 --> 01:14:07,089 Also, in the captioning sense, sometimes you can actually pretrain some word vectors relating to the language, as well. 712 01:14:07,089 --> 01:14:14,143 You maybe pretrain the CNN on ImageNet, pretrain some word vectors on a large text corpus, and then fine tune the whole thing for your dataset. 713 01:14:14,143 --> 01:14:22,278 Although in the case of captioning, I think this pretraining with word vectors tends to be a little bit less common and a little bit less critical. 714 01:14:22,278 --> 01:14:33,225 The takeaway for your projects, and more generally as you work on different models, is that whenever you have some large dataset, whenever you have some problem that you want to tackle, but you don't have a large dataset, 715 01:14:33,225 --> 01:14:41,673 then what you should generally do is download some pretrained model that's relatively close to the task you care about, and then either 716 01:14:41,673 --> 01:14:44,859 reinitialize parts of that model or fine tune that model for your data. 717 01:14:44,859 --> 01:14:50,676 That tends to work pretty well, even if you have only a modest amount of training data to work with. 718 01:14:50,676 --> 01:14:57,738 Because this is such a common strategy, all of the different deep learning software packages out there provide a model zoo where you can just download 719 01:14:57,738 --> 01:15:01,099 pretrained versions of various models. 720 01:15:01,099 --> 01:15:06,043 In summary today, we talked about optimization, which is about how to improve the training loss. 721 01:15:06,043 --> 01:15:10,884 We talked about regularization, which is improving your performance on the test data. 722 01:15:10,884 --> 01:15:12,838 Model ensembling kind of fit into there. 723 01:15:12,838 --> 01:15:17,440 We also talked about transfer learning, which is how you can actually do better with less data. 724 01:15:17,440 --> 01:15:21,940 These are all super useful strategies. You should use them in your projects and beyond. 725 01:15:21,940 --> 01:15:25,238 Next time, we'll talk more concretely about some of the different deep learning software pakages out there